## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
The white wine data set has information on 4898 wines that were graded by wine experts. The data set contains information on a given wine’s acidity, sugar concentration, pH, alcohol concentration, etc. The 12th variable is quality of each wine graded by experts from 0 (bad) to 10 (excellent).
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
The first column is just a column index so I took it out. Let’s take a look at the distributions of the variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
Minimum and maximum wine ratings are 3 and 9, respectively. Most wines received a rating of 6. Very few wines have received ratings of 3 or 9. Let’s take a look at other variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Log-transformed the variable to better visualize distribution. Most wines have volatile acidity (amount of acetic acid) of around 0.3. It is said in the white wine document that too high of acetic acid can lead to an unpleasant, vinegar taste. Would I observe an inverse relationship between wine rating and volatile acidity later on?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Seems to peak around 0.3 with a few outliers to the right. It is said that citric acid can add freshness and flavor to wines. I’m interested to see the relationship between wine rating and this variable as well.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Log-transformed the variable and cut off outliers for better visualization. The distribution appears bi-modal and has peaks around 1.7 and 8.5. There’s an insane outlier (65.8g of sugar). Definitely interested to see general relationship between wine quality and sugar concentration.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Log-transformed for better visualization. Chlorides seem to peak around 0.044.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Squared-root-transformed for better visualization. Peaks around 130 and has a pretty crazy outlier (440.0). Sulfur Dioxide (SO2) prevents microbial growth and the oxidation of wine. The white wine document says that free SO2 concentration of over 50 can be detected in the nose and taste of wine. I’m interested in seeing how this variable also affects wine rating.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
All wines are between 3-4 on the pH scale. Since the range is fairly narrow, I don’t think it will influence wine rating by much but we’ll see.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Seems to peak around 0.5. Sulphate is a wine additive that can contribute to SO2 levels. I expect this to correlate quite strongly with SO2 level. Would they also have similar effects on wine rating, if any?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Alcohol is distributed across fairly large range (8 - 14.20). How would alcohol affect wine quality?
The data has 4898 observations of 12 variables. 11 of these variables represent properties of a wine such as acidity, sugar concentration, pH, alcohol concentration, etc. The 12th variable is quality of each wine graded by experts from 0 (bad) to 10 (excellent). All of the variables are continous except for wine quality. Most of them are unimodal and a few of them have outliers.
The main features of interest are which variable(s) influence wine quality/rating significantly and how the changes in these variables influence the quality of wine.
I log-transformed and squared-root-transformed several variables to make skewed distributions less skewed to see the patterns in the data more clearly.
#chaning wine quality from numeric to factor.
w$quality = as.factor(w$quality)
Here I change the variable “quality” to factor so I can create proper boxplots.
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
It seems to me that wines that received ratings of 5 and higher have fairly narrow range of volatile acidity compared to those that received ratings of 3 or 4. This makes sense because volatile acidity (amount of acetic acid) is said to be unpleasant at high levels. What’s important to note here and throughout the analysis is that there are only five wines that received a rating of 9 so it may be difficult to tell what the wine quality’s true distribution looks like.
I cut off the outliers to visualize the distribution better. Right off the bat I can see there isn’t a simple, linear relationship between sugar and wine quality. However, for wines of ratings greater than or equal to 5, it seems that wine quality increases as sugar content decreases. However, since the overall trend isn’t linear, there may be other variables that are influencing wine quality here.
## [1] -0.116647
As expected, there is a weak negative correlation (using Spearman) of -0.117 between wine quality and sugar concentration for wines that received 5 or higher ratings. Here and throughout the analysis, Spearman correlation is used because it is less sensitive to outliers and do not assume normal distribution.
## [1] -0.3144885
The boxplot seems to indicate that higher quality wines tend to have less sugar concentration. As expected from the boxplot, there is a moderate negative correlation of -0.314 between salt concentration and wine quality.
## [1] -0.1966803
The boxplot and the correlation coeficient shows that a weak negative relationship exists between wine quality and total sulfur dioxide. This makes sense because too high of a free sulfur dioxide concentration (above 50) is said to be detectable by taste and nose and is unpleasant.
## [1] 0.03331897
I don’t see a noteworthy trend here.
## [1] 0.4403692
I see the strongest positive correlation yet seen between alcohol and wine quality. The correlation is even more apparent when you see the boxplot for wines that have ratings of 5 or greater. Is it just alcohol that is influencing the wine quality or is alcohol correlated with other features that also influence wine quality? To answer this, let’s see a correlation matrix of all features.
From the correlation matrix, you can tell that alcohol is negatively correlated with residual.sugar, chlorides, total.sulfur.dioxide. This means that wines with higher alcohol concentration tend to have less sugar, salt, and total sulfur dioxide. Since all of these variables are negatively correlated with wine quality (“numQuality”), wines that have low sugar, salt, and total sulfur dioxide are more likely to be high quality wines.Therefore, alcohol concentration may show strongest negative correaltion with wine quality simply due to the fact that wines with high alcohol concentration tend to have low sugar, salt, and total sulfur dioxide.
Volatile acidity, sugar, salt, and total sulfur dioxide are negatively correlated with wine quality. Alcohol, on the other hand, is positively correlated with wine quality.
Alcohol is negatively correlated with all other variables aforementioned (volatile acidity, sugar, salt, and total sulfur dioxide) !
In terms of correlation, alcohol had the strongest relationship with wine quality.
In the last section, we found variables that are correlated with wine quality. Now I’m curious to see how wine ratings are distributed among combinations of these variables (total sulfur dioxide, chlorides, volatile acidity, residual sugar, alcohol). Before proceeding, let’s take a look at the correlation matrix one more time to make sure we haven’t missed anything important.
Besides the five variables we took a note of, there is one more variables that show weak, negative correlation with wine quality: fixed acidity. I’ll include this variable for exploration in this section.
w$newQualityLevel = cut(w$numQuality, breaks = c(0, 4, 7, 10))
For the following section, I divided the wine quality ratings (from 1 to 10) into 3 intervals ( (0,4], (4, 7], (7,10] ) to better visualize patterns in the data. The variable newQualityLevel stores this information. I’ll refer to wines in range (7, 10] as “high quality” wines, wines in range (4, 7] as “medium quality” wines, wines in range(0, 4] as “low quality” wines.
Here I’m looking at the distributions of wines of varying qualities across total sulfur dioxide and chlorides concentration. High quality wines are mostly distributed from around 80 to 180 mg/dm^3 of total sulfur dioxide. Medium quality wines are distributed a bit more widely from around 70 to 250 mg/dm^3. Low quality wines are distributed from 50 to 200 mg/dm^3. There doesn’t seem to be notable separation of wine qualities across chlorides concentrations.
There doesn’t seem to be notable separation of wine qualities across both the residual sugar concentration and volatile acidity.
It appears that the majority of high quality wines have alcohol concentraion of 10 - 13 % while lower quality wines range more evenly from 8.5 - 13 %. There doesn’t seem to be notable separation of wine qualities across fixed acidity.
Out of the six variables we looked at, only alcohol and total sulfur dioxide seem to have visually distinct distributions of wine ratings. Let’s look at both of these variables in a single plot.
As expected, differences in the distributions of wine ratings are visible across the two variables.
Seeing how total sulfur dioxide influences wine distributions, I’m curious about free sulfur dioxide’s influence on wine distributions. The white wine documentation says that free sulfur dioxide (SO2) prevents microbial growth and the oxidation of wine. It also says that SO2 concentration of over 50 ppm becomes evident in the nose and the taste. Let’s see how free SO2 concentration influences the distributions of wine ratings.
There doesn’t seem to be notable separation of wine qualities. I wonder if the ratio of free SO2 to total SO2 concentration may tell a better story. Let’s find out.
The pattern is clearer! High quality wines are mostly distributed from around 0.15 to 0.4. Medium quality wines are distributed a bit more widely from around 0.1 to 0.4. Low quality wines seem to be mostly distributed from 0.04 to 0.3.
There were three features that strengthened each other: alcohol, total sulfur dioxide, and the ratio of free SO2 to total SO2. The distribution of highest quality wine separated from the rest of the wine groups in that most high quality wines had around 10 - 13 % alcohol while many medium and low quality wines had alcohol well below 10 % alcohol. The distribution of high quality wines were also distinct from lower quality wines in terms of total sulfur dioxide concentration and the ratio of free SO2 to total SO2 concentration, although not as distinctly as alcohol.
The distributions (median, 1st quartile, 3rd quartile) of the amount of alcohol wines of a given rating contain continue to rise as we move from quality ratings of 5 to 9. This indicates that high quality wines are more likely to have higher alcohol concentration.
Similar to the first plot, the plot above shows that high quality wines (wines of ratings 8 and 9) tend to have higher alcohol concentration than lower quality wines. Specifically, the plot shows that high quality wines tend to have alcohol concentration of 10.5 to 13.5 %. Medium quality wines (wines of ratings from 5 to 7) have around 8.5 to 13 % alcohol and low quality wines (wines of ratings 3 and 4) have around 8.5 to 12 % alcohol. Fixed Acidity, which does not have a big impact on wine quality distributions, was chosen as y-axis to make wine quality distributions across alcohol concentration stand out.
The plot above shows that high quality wines (wines of ratings 8 and 9) are mostly distributed from 0.15 to 0.4 SO2 ratio (free sulfur dioxide / total sulfur dioxide). Medium quality wines (wines of ratings from 5 to 7) are distributed more widely from 0.07 to 0.42 SO2 ratio. Low quality wines (wines of ratings 3 and 4) are mostly distributed from around 0.03 to 0.27 SO2 ratio. Fixed Acidity, which does not have a big impact on wine quality distributions, was chosen as y-axis to make wine quality distributions across the SO2 ratio stand out.
At first, I struggled with what type of plot to choose for data exploration. Because wine quality is a discrete variable, my usual go-to scatterplot was unusable. When I tried boxplots, however, it was very easy to recognize patterns in the data. Another difficulty that I faced was that there weren’t many wines that received ratings of 9 and 3 so it was hard to draw conclusion on how these wines are different from other wines. This is why I decided to merge wines to form three groups. Luckily, when I merged wine ratings, there were recognizable patterns among the groups of wines of different ratings.
The most surprising finding was that high quality wines tend to have higher alcohol concentration - which I find distasteful- and less of savory ingredients like sugar and salt.
Although I have tried many features to discover underlying patterns in the data, I must admit that it is not comprehensive. One thing to try in the future is fitting a multiple regression to the data. Because a multiple regression’s coefficients carry information about how one variable affects the variable of interest (in this case wine rating) when all other variables are held equal, the regression can shed new insight. However, in order to know if the coefficients are any good, one may need to check the accuracy of the model and make sure it is good enough. Doing this, however, may be outside the scope of exploratory data analysis.